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Abstract 

Background: Counting /c-mers (substrings of length k in DNA sequence data) is an essential component of many 
methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and 
for error correction of sequence reads. Although simple in principle, counting /c-mers in large modern sequence 
data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction- 
often more than 50%-of the storage capacity may be spent on storing /c-mers that contain sequencing errors and 
which are typically observed only a single time in the data. These singleton /c-mers are uninformative for many 
algorithms without some kind of error correction. 

Results: We present a new method that identifies all the /c-mers that occur more than once in a DNA sequence 
data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed k- 
mers implicitly in memory with greatly reduced memory requirements. We then make a second sweep through 
the data to provide exact counts of all nonunique /c-mers. For example data sets, we report up to 50% savings in 
memory usage compared to current software, with modest costs in computational speed. This approach may 
reduce memory requirements for any algorithm that starts by counting /c-mers in sequence data with errors. 

Conclusions: A reference implementation for this methodology, BFCounter, is written in C++ and is GPL licensed. 
It is available for free download at http://pritch.bsd.uchicago.edu/bfcounter.html 



Background 

With recently-developed methods for massively parallel 
DNA sequencing it is now practical for individual labs 
to perform whole-genome or transcriptome sequencing 
of a wide variety of organisms, and to perform metage- 
nomic sequencing of environmental samples. Addition- 
ally, these new sequencing technologies are becoming 
widely used for reduced representation sequencing and 
genotyping of non-model organisms [1,2], including 
those with no available genome sequence. 

Each of these applications involves de novo assembly 
from very large numbers of short reads. Despite pro- 
gress in recent years, de novo assembly remains a com- 
putationally challenging task. The current research for 
assembly with short reads is focused on de Bruijn graph 
methods [3-7]. The nodes in a de Bruijn graph are the 
/c-mers of a pre-specified length k that are contained 
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within the sequencing reads. Two /c-mers are connected 
in the graph if they are adjacent in at least one sequen- 
cing read. Although de Bruijn graphs provide a nice 
conceptual framework that cuts down on computation 
time, the size of the graph can be very large, typically 
including billions of /c-mers for vertebrate-sized 
genomes. 

In order to deal with the computational challenges of 
working with such large data sets, a number of methods 
have been proposed for storing /c-mers efficiently. Most 
de Bruijn graph assemblers store /c-mers using 2 bits to 
encode each nucleotide, so that each Zc-mer | takes 
bytes. The /c-mers are then stored in a hash table, 
usually with some associated information such as cover- 
age and neighborhood information in the de Bruijn 
graph. The exact memory usage depends on the hash 
table used; for example, the assembly software ABySS 
[6] uses the Google sparsehash library, which has mini- 
mal memory overhead http://code.google.com/p/google- 
sparsehash/. Additionally, ABySS can share the memory 
load across multiple machines, splitting up the hash 
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table so that each potential k-mer is assigned to a 
unique machine, although this setup has more commu- 
nication overhead across machines and requires addi- 
tional work by the end user. A recently-developed 
program named Jellyfish is specifically designed for /c- 
mer counting (for A:-mers of up to 32 bp) [8]. It uses a 
"quotienting" technique [9] to reduce the space needed 
to store each k-mer in a hash table, and it achieves 
much lower memory usage than other available meth- 
ods. Additionally, [10] show how to compress both the 
de Bruijn graph and the k-mei coverage counts to nearly 
the optimal. However this compression is done after all 
the Zr-mers have been counted, in contrast to Jellyfish. 

A complementary strategy for reducing memory usage 
is based on the observation that in current data sets, a 
large fraction of the observed Ar-mers may arise from 
sequencing errors. Most of these occur uniquely in the 
data, and hence they greatly increase the memory 
requirements of de novo assembly without adding much 
information. For this reason, it is frequently helpful to 
either discard unique Aimers prior to building the 
graph, or to attempt to correct them if they are similar 
to other, much more abundant, /c-mers [11-14], For 
example, the team that sequenced the giant panda gen- 
ome obtained 56-fold coverage of the 2.4 GB genome 
on the Illumina sequencing platform [11]. Using a 
supercomputer with 512 GB of RAM, the authors 
counted a total of 8.62 billion 27-mers. After removing 
or correcting low-coverage /c-mers, they eliminated 68% 
of the observed A-mers, reducing the total number to 
just 2.69 billion. Their genome assembly was based on 
this reduced set. 

More generally, while the number of true Ar-mers in a 
genome sequence is at most the genome length, G (or 
less in practice, due to repeats), the number of spurious 
/c-mers grows almost linearly with sequencing depth. To 
illustrate this, if we assume a uniform error rate a per 
nucleotide, then the expected number of spurious k- 
mers at sequence coverage C is GC '^'p 1 (1 — (1 — a) k ), 
where / is the length of sequence reads. (This calcula- 
tion ignores the rare events in which an identical 
sequencing error occurs more than once, and that error 
rates are typically highest near the ends of reads.) Then 
for example, at an error rate of 1% per base, read length 
of 100 bp, and k = 31, the number of spurious /c-mers 
would exceed the genome length G at just 5.33-fold 
coverage. 

However, even the seemingly simple goal of eliminat- 
ing singleton, or low coverage, Ar-mers is computation- 
ally demanding in practice, since we do not know a 
priori which /r-mers have low coverage. An obvious 
approach would be to simply load all observed /r-mers 
into a hash table while counting the number of 



occurrences of each. But this task alone can easily over- 
whelm the memory of standard high performance 
machines. 

The goal then is to implement a method for identify- 
ing unique /c-mers (or more generally, /r-mers that occur 
< n times), that makes highly efficient use of memory 
while providing efficient storage of Ar-mers with fast 
insertion and query times. The problem of counting the 
number of distinct Zr-mers is much easier if we are will- 
ing to settle for an approximate answer that works with 
high probability [15]. 

Here, we describe an approach to solving this problem 
by storing an implicit and highly compact representation 
of the observed /c-mers, known as a Bloom filter. A 
reference implementation, implemented in a C++ pro- 
gram called BFCounter, is freely available. We show 
empirical results of applying this method to published 
sequencing data. We also discuss possible extensions 
and further applications of the method. 

Results and Discussion 

The Bloom Filter 

The Bloom filter is a probabilistic data structure sup- 
porting dynamic set membership queries with false posi- 
tives [16]. It allows us to identify in an extremely 
compact way all /c-mers that are present more than 
once in a data set, while allowing a low rate of false 
positives. Bloom filters have been used widely in com- 
puting applications, but to date rarely in bioinformatics, 
but see [14,17,18]. 

The essential idea is illustrated in Figure 1. The Bloom 
filter is a bit array B, initialized to be 0 at every position. 
We also define a set of d hash functions, hi, h^, 
where each hash function maps a given k-mer x to a 
location in B. 
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Figure 1 Bloom filter example. An example of a Bloom filter with 
three hash functions. The k-mers a and b have been inserted, but c 
and d have not. The three hash functions are represented with 
arrows, and the bits corresponding to the hashes for a and b have 
been set to 1. The Bloom filter indicates correctly that k-mer c has 
not been inserted since not all of its bits are set to 1. However, k- 
mer d is an example of a false positive: it has not been inserted, but 
since its bits were set to 1 by the insertion of a and b, the Bloom 
filter falsely reports that d has been seen already. 
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In order to insert a k-mer x into the Bloom filter, we 
set all of the d corresponding locations in B to be 1; 
that is, we set B[hi{x)] = 1 for i = 1, d. Then, to 
determine whether a /c-mer y has been inserted, we sim- 
ply check whether each of the corresponding hash posi- 
tions is 1: i.e., whether B[h t (y)] are all set to 1 for i = 1, 
d. If this is the case, then we infer that y has probably 
been seen before. By construction, this procedure cor- 
rectly identifies every Ar-mer that is present more than 
once in the data; however, the cost of very efficient 
memory usage is that we accept a low rate of false posi- 
tives in which we infer that y has been seen previously, 
but in fact it has not. 

The Bloom filter has a tradeoff between memory 
usage (i.e., the number of bits used) and the false posi- 
tive rate. When storing n A-mers in a Bloom filter of m 
bits, and using d hash functions, the false positive rate is 
approximately (l — e~ d ™) d - Given n and m, the optimal 
number of hash functions that minimizes the false posi- 
tive ratio is d & — ln2[19]. In practice we may have a 
rough idea in advance about «, the number of /c-mers, 
and we can select m as a fixed multiple of n. For exam- 
ple using m = 8 • n (which corresponds to storing one 
byte per /c-mer), and d = 5 gives a false positive ratio of 
2.16%. Many variations and improvements have been 
proposed for Bloom filters [20,21]; or see [19] for a 
survey. 

Storing and counting /c-mers using the Bloom Filter 

To count all non-unique A-mers we use a Bloom filter B 
and a simple hash table T to store A-mers. The Bloom 
filter keeps track of k-mers we have encountered so far 
and acts as a "staging area", while the hash table stores 
all the k-mers seen at least twice so far. The idea is to 
use the memory-efficient Bloom filter to store implicitly 
all k-mers seen so far, while only inserting non-unique 
/c-mers into the hash table. 

Initially both the Bloom filter and the hash table are 
empty. All /c-mers are generated sequentially from the 
sequencing reads. Note that in most applications we do 
not need to distinguish between a k-mer and its reverse 
complement sequence. Thus, as we read in each /c-mer 
we also consider the reverse complement of that A-mer 
and then work with whichever of the two versions is 
lexicographically smaller (we refer to the smaller 
sequence as the "canonical k-mer"). 

For each k-mer, x, we check if x is in the Bloom filter 
B. If it is not in B then we update the appropriate bits 
in B to indicate that it has now been observed. If X is in 
B, then we check if it is in T, and if not, we add it to T. 

This scheme guarantees that all /c-mers with a cover- 
age of 2 or more are inserted into T. However a small 
proportion of unique /c-mers will be inserted into T due 



to false positive queries to B. After the first pass through 
the sequence data, one can re-iterate over the sequence 
data to obtain exact counts of the A-mers in T and then 
simply delete all unique A-mers. The time spent on the 
second round is at most 50% of the total time, and 
tends to be less since hash table lookups are generally 
faster than insertions. A detailed pseudocode is given in 
Figure 2. 

It is also possible to obtain approximate k-mer counts 
by iterating only once over the sequence reads. In this 
case we record a coverage count of 2 when first inserting 
a A-mer into the hash table T, and subsequently incre- 
ment the counter for each additional observation of this 
k-mer. This means that the coverage counts for some ti- 
mers are 1 higher than the true value, and some /r-mers 
in Tare in fact false positives (i.e., present only once). 

Higher Coverage Cutoffs 

For some applications a higher coverage cutoff may be 
required to either filter out sequencing errors or to sim- 
ply extract sequences of interest. The algorithm can be 
extended to use counting Bloom filters, where each bit 
in the bit array is now replaced with a counter that uses 
only a small number of bits. If the desired minimum 
coverage is c we use an array of m riog 2 (c)_L-bit coun- 
ters. The counting Bloom filter was introduced by [20] 
to allow for deletions, but here we use the counts 
directly. 

To check if a k-mer should be inserted into the hash 
table T we look to see if all of B[hi(x)] are equal to c - 
1. Otherwise we insert it into the Bloom filter. When 
inserting a k-mer x, we set 

B[hi(x)] +- mm{B[hi{x)] + l,c- 1} 

for i = 1, d. Note that for a A-mer x, mm{B[hi(x)]\i 
= 1, d] gives an upper bound on the number of 

/ ^ 

Algorithm 1 Bloom filter fc-mer counting algorithm 

1: B empty Bloom filter of size m 

2: T <- hash table 

3: for all reads s do 

4: for all h-meis x in s do 

5: x rep mm(x, revcomp(x)) //x rep is the canonical k-mer for x 

6: if x Tep 6 B then 

7: if x. rep ^ T then 

8: T[x rep ] <r- 0 

9: else 

10: add x rep to B 

11: for all reads s do 

12: for all k-mers x in s do 

13: Xrep 4~ min(.r, revcompfj:)) 

14: if x rep £- T then 

15: T[x„ p ] <- T[x„ p ] + 1 

16: for all x £ T do 
17: if T[x] = 1 then 
IS: remove x from T 

Figure 2 Algorithm pseudocode. A pseudocode for the Bloom 
filter /c-mer counting algorithm. 
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occurrences of x so far. Of course the basic version sim- 
ply corresponds to the case of c = 2. 

Parallelizability 

The algorithm is presented above as a standard single 
processor program and our current implementation is 
not multi-threaded. 

Nonetheless it would be possible to speed up the 
operations using multiple cores with lock-free data 
structures. This would require a non-blocking imple- 
mentation of the hash table [22] and a modification to 
the Bloom filter. The bit array in the Bloom filter is 
implemented as an array of word-sized integers, usually 
32 or 64 bits. To avoid accidental collisions where two 
bit locations in the same word are updated, one can use 
"compare-and-swap" (CAS) operations on words to 
ensure atomic updates of each bit independently. 

Since the role of the Bloom filter is to keep track of k- 
mers seen previously, this scheme could plausibly fail in 
the unlikely event that two occurrences of the same k- 
mer are inserted into the Bloom filter simultaneously by 
different threads. In this case the two threads would both 
query the Bloom filter for a ir-mer, x, and after both 
receive a negative answer the two threads would insert x 
simultaneously. If x occurs exactly twice in the data set 
then we would fail to record it in the hash table and get a 
false negative, although this type of false negative seems 
unlikely to be a serious concern in practice. However this 
can be fixed by extending the Bloom filter data structure 
to return the number of bits set to 1 when querying, and 
the number of bits changed from 0 to 1 when inserting. 
This makes insertion atomic, each thread can then deter- 
mine when inserting a new k-mer into the Bloom filter 
whether any other threads were inserting the same /c-mer 
simultaneously by comparing the number of bits changed 
from 0 to 1. If the two numbers do not match, we can 
infer that some other thread had already inserted the k- 
mer into the Bloom filter and proceed with inserting the 
/c-mer into the hash table. 

Implementation 

We implemented this algorithm in a program called 
BFCounter in C++, available from http://pritch.bsd.uchi- 
cago.edu/bfcounter.html The source code is licenced 
under a GPL licence. For the implementation we used 
the Google sparsehash library and a Bloom filter library 
by A. Partow http://www.partow.net/programming/hash- 
functions/index.html. We store a 1-byte counter for 
each k-mev and by default A-mers take 8-bytes of mem- 
ory with a maximum k of 31, although if desired, larger 
/c-mers can be specified at compile time. We require the 
user to specify an estimate for the number of A-mers in 
the sequencing data and use a Bloom filter with 4 times 
as many bits as the expected number of £-mers this 



corresponds to a memory usage of 4-bits per /c-mer and 
the optimal number of hash functions functions for the 
Bloom filter is d = 3. 

Example data sets 

To illustrate the performance of the new method, we 
describe the analysis of two data sets of sequencing 
reads from human genomic DNA. The first data set 
consists of 7.5 M 100 bp paired-end reads from the Illu- 
mina platform that mapped to Chromosome 21. These 
data, from HapMap individual NA19240, are available 
from Illumina at http://www.illumina.com/truseq/tru_re- 
sources/datasets.ilmn. This data set corresponds to 
approximately 32-fold coverage of Chromosome 21, a 
coverage-level that is typical of many contemporary 
sequencing studies. Since the reads have already been 
mapped to a genome this likely represents a cleaner 
data set (i.e., with fewer errors and lower repeat con- 
tent) than we would expect to get from unprocessed 
sequence data. 

The second data set consists of genome-wide 
sequence data from the 1000 Genomes Project Pilot II 
study [23]. Individual NA19240 was sequenced at 40- 
fold coverage, using 2.66 billion 36 bp paired-end Illu- 
mina reads. The data were filtered to remove sequences 
with low quality scores and missing basecalls; they are 
available at ftp://ftp.1000genomes.ebi.ac.uk/voll/ftp/ 
data/NA19240/sequence_read/. 

Our first application is to the 32-fold sequence data 
from Chromosome 21 data. We collected all A-mers 
from the sequencing reads, using k = 31. Figure 3 shows 
the distribution of the number of times each /c-mer is 
seen in the input data. Out of 80.4M observed /r-mers, 
slightly more than half (48. 7M) are observed only a sin- 
gle time. The vast majority of these singleton /c-mers 
(99.87%) are not found in the reference genome and 
hence are most likely due to sequencing errors, thus 
supporting the approach of discarding or correcting 
these unique A-mers. 

Figure 4 illustrates how the total number of Aimers, 
and the number of nonunique Aimers increases with 
sequencing depth for this data set. For the unfiltered k- 
mers we see the same behavior with increasing coverage 
as expected from the Introduction: namely, the total 
number of A-mers found increases approximately line- 
arly for coverage levels greater than about 5X. This 
increase is almost completely due to the increase in 
unique /c-mers that contain errors. In contrast, the 
number of non-unique /r-mers is only slightly more 
than the expected number based on the number of dis- 
tinct A-mers in the hgl8 genome sequence from Chro- 
mosome 21. 

To evaluate the computational performance of 
BFCounter we compared it to Jellyfish [8] and to a naive 
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Number of 31-mers 




Fold coverage 

Figure 3 Number of fr-mers The plot shows the number of 
distinct k-mers found in the sequencing data from chr21 at 
different coverage levels, based on random subsampling of the 
data. The total number of distinct k-mers in the hg18 genome 
sequence of chr21 is 32.5 million /c-mers. Unfiltered, the number of 
/c-mers found increases at a steady rate after 5-fold coverage. When 
unique k-mers are removed, the number of filtered k-mers 
approaches the ideal number at around 7-fold coverage and the 
rate of increase is significantly reduced. 



31-mer count distribution on Chromosome 21 
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Figure 4 k-mer distribution. Distribution of coverage levels for k- 
mers in the sequence reads from chromosome 21. There is a clear 
distinction between the coverage levels of the 31. 7M observed k- 
mers that are found in the hg 1 8 reference genome sequence 
compared to the 48.7M k-mers that are not in hg18. Of the k-mers 
not found in hg18, 44.5M or 99.87%, are observed only once, and 
are likely sequencing errors. A small fraction of /c-mers that do not 
match hg18 are observed many times in the data; these likely 
represent SNP differences between the sequenced individual and 
hg18 and would be retained by the Bloom filter. 



/c-mer counting program without any filtering. All com- 
parisons were done on a 64-bit x86 Intel Xeon machine 
with 8 cores at 2.4 GHz and 144 GB of memory running 
Linux kernel version 2.6.18. The disks were all from 
shared network through Lustre. All time measurements 
were done with the time unix command and memory 
usage was measured using strace. 

The naive version simply stores all Aimers explicitly in 
a Google sparsehash hash table and skips the filtering 
step. Jellyfish is a sophisticated k-mer counting program 
that features support for multicore machines. Further- 
more Jellyfish stores an implicit representation of /c-mers 
in a hash table to save memory. The authors of the Jelly- 
fish program recently showed that their method provides 
large memory savings compared to other traditional 
methods for /c-mer counting. Jellyfish requires us to pre- 
specify the size of the hash table to use; if the hash table 
fills up, the results are written to disk and merged later. 
To compare the programs we found the minimum size 
so that Jellyfish could keep all /r-mers in memory. For the 
second data set Jellyfish could not fit all /c-mers in mem- 
ory with default parameters. To fit the hash table in 
memory we needed to set the number of reprobes to 255 
by running Jellyfish with the -p 255 option. For timing 
comparisons we run Jellyfish in serial mode. 



The increase in the number of /r-mers affects the 
memory consumption directly. Figure 5 plots the mem- 
ory requirements of BFCounter, Jellyfish and the naive 
version. The increase in memory levels off for BFCoun- 
ter after about 7-fold coverage, whereas for the naive 
version and Jellyfish the memory increases steadily as 
the number of /r-mers grows. 

Table 1 presents the memory and time requirements for 
the three methods when applied to the second data set 
(40-fold coverage of a human genome with 36 bp reads). 
For this analysis we set the /r-mer length k = 25, which 
strikes a balance between the number of /r-mers produced 
by each read, here 11, and the specificity of the /r-mers. 
Although for this data set the average basepair coverage is 
fixed, the /c-mer coverage decreases with k. On the other 
hand increasing k gives more observed /r-mers, since 
sequencing errors can generate up to k unique /c-mers. 

There are 12.18 billion /r-mers present in the sequen- 
cing reads, of which 9.35 billion are unique and 2.83 bil- 
lion have coverage of two or greater (compared to 2.37 
billion distinct 25-mers in the hgl8 genome sequence). 
When BFCounter was run, about 0.5 billion of the 
unique /c-mers were stored in the hash table after the 
first phase which corresponds to a 5.3% false positive 
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usage. The memory usage of the three 



programs at different coverage levels (Chromosome 21 data). Note 
that Jellyfish and the naive counter are storing all /c-mers while 
BFCounter filters out most unique /<-mers without storing them 
explicitly in memory. The memory usage of BFCounter and the 
naive version roughly mimic the shape for the number of filtered k- 
mers in Figure 3. The discrete jumps in the memory usage of 
Jellyfish are due to implementation details as the size of the hash 
table has to be a power of 2. 



rate for the Bloom filter. Thus, BFCounter stored 27% of 
the original k-mers after the first pass, and this was cut 
to 23% after false positives were removed. 

As may be seen from the table, BFCounter uses con- 
siderably less memory than either Jellyfish or the naive 
hash table method. Indeed the naive method ran out of 
memory and was unable to complete. However, 
BFCounter takes approximately three times longer to 
run as Jellyfish. Part of the difference in speed is due to 
BFCounter taking a second pass through the data to 
obtain exact k-mer counts (which may not be essential 
for all applications). 



assembly from short read sequence data. However, in 
current data sets, it is frequently the case that more than 
half of the reads contain errors and are observed just a 
single time. Since these error-containing £-mers are so 
numerous, they can overwhelm the memory capacity of 
available high-performance machines, and they increase 
the computational complexity of downstream analysis. 

In this paper, we describe a straightforward applica- 
tion of the Bloom filter data structure to help identify 
and store the reads that are present more than once (or 
more than n times) in a data set, and are therefore far 
more likely to be correct. By doing so, we achieve 
greatly reduced memory requirements compared to a 
naive but memory-efficient hash table method, as well 
as to Jellyfish (which has been highly optimized for 
memory efficiency, while storing all /c-mers). For many 
applications, it may be sufficient to simply ignore the 
unique £-mers (as was done for the panda genome); 
alternatively, users may prefer to "correct" reads by 
comparing unique k-mevs to common /r-mers [11-14]. 
In summary, the approach presented here could be 
straightforwardly incorporated into a wide variety of 
algorithms that start by counting /r-mers. 

Our method trades off reduced memory usage for an 
increase in processing time. In many cases the memory 
limitation is a hard threshold and the counting of £-mers 
is only run once and a fixed set of /c-mers is stored for 
future computation. For genome assembly methods the 
construction of de Bruijn graphs dominates memory con- 
sumption [7] and the time for completion can be several 
days [13], depending on the amount of postprocessing. 
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Conclusions 

Counting k-mers from sequencing data is an essential 
component of many recent methods for genome 

Table 1 Memory usage and Time for whole genome data 



Program 


Time (hrs) 


Memory (GB) 


BFCounter 


23.82 


42 


Jellyfish 


8.03 


71 


Naive* 


>26.38 


>128 



Memory and time usage for the 1000 Genomes data set (40-fold coverage of 
individual NA19240). The naive version ran out of memory after processing 
84.7% of the reads. 
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